Data store
A datastore is a reference to an existing storage account on Azure. Therefore, when you refer to data stored in a datastore, you may be referring to data being stored in an Azure Blob Storage or Azure Data Lake Storage.
Datastore auth:
- Credential-based: Use a service principal, shared access signature (SAS) token or account key to authenticate access to your storage account.
- Identity-based: Use your Microsoft Entra identity or managed identity.
Datastores serve as abstractions for cloud data sources, streamlining how data is managed and accessed within Azure Machine Learning. By creating a datastore, you essentially encapsulate the connection and authentication details within your workspace, eliminating the need to hard-code this sensitive information in your scripts. This setup not only enhances security but also simplifies data access. Utilizing the azureml protocol, users can access data stored in containers, with the datastore acting as a direct reference to an existing storage account on Azure. This means that when data is mentioned in the context of a datastore, it could reside in Azure Blob Storage or Azure Data Lake Storage. Direct reference to the datastore bypasses the need for separate authentication since Azure Machine Learning uses the stored connection information to securely access the data. This integrated approach ensures that managing and referencing data across Azure's storage solutions is both seamless and secure, streamlining workflows for developers and data scientists alike.
creates a protective layer if you want users to use the data, but not connect to the underlying storage service directly.
The benefits of using datastores are:
- Provides easy-to-use URIs to your data storage.
- Facilitates data discovery within Azure Machine Learning.
- Securely stores connection information, without exposing secrets and keys to data scientists.
Data Asset:
In Azure Machine Learning, data assets are references to where the data is stored, how to get access, and any other relevant metadata. You can create data assets to get access to data in datastores, Azure storage services, public URLs, or data stored on your local device. A data asset can be parsed as both an input or output of an Azure Machine Learning job.
Types of data assets:
-
URI file: Points to a specific file.
- Local:
./<path>
- Azure Blob Storage:
wasbs://<account_name>.blob.core.windows.net/<container_name>/<folder>/<file>
- Azure Data Lake Storage (Gen 2):
abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>
- Datastore:
azureml://datastores/<datastore_name>/paths/<folder>/<file>
- Local:
-
URI folder: Points to a folder.
-
MLTable: Points to a folder or file, and includes a schema to read as tabular data.
The benefits of using data assets are:
- You can share and reuse data with other members of the team such that they don't need to remember file locations.
- You can seamlessly access data during model training (on any supported compute type) without worrying about connection strings or data paths.
- You can version the metadata of the data asset.
When you create a data asset and point to a file or folder stored on your local device, a copy of the file or folder will be uploaded to the default datastore
workspaceblobstore. You can find the file or folder in the
LocalUpload folder. By uploading a copy, you'll still be able to access the data from the Azure Machine Learning workspace, even when the local device on which the data is stored is unavailable.